Residual Value Iteration Algorithm Based on Function Approximation
CHEN Jianping1,2,3, HU Wen1,2,3, FU Qiming1,2,3,4
1.School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009 2.Jiangsu Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou 215009 3.Suzhou Key Laboratory of Mobile Networking and Applied Technologies, Suzhou University of Science and Technology, Suzhou 215009 4.Symbol Computation and Knowledge Engineer of Ministry of Education, Jilin University, Changchun 130012
Abstract:Aiming at the problem of unstable and slow convergence of traditional value iteration algorithm, an improved residual value iteration algorithm based on function approximation is proposed. The traditional value iteration algorithm and the value iteration algorithm with Bellman residual are combined. Weight factors are introduced and new rules are constructed to update value function parameter vector. Theoretically, the new parameter vector can guarantee the convergence of the algorithm and solve the unstable convergence problem in the traditional value iteration algorithm. Moreover, the forgotten factor is introduced to speed up the convergence of the algorithm. The experimental results of Grid World problem show that the proposed algorithm has good performance and robustness.
陈建平,胡文,傅启明. 基于函数逼近的冗余值迭代算法*[J]. 模式识别与人工智能, 2017, 30(7): 663-672.
CHEN Jianping, HU Wen, FU Qiming. Residual Value Iteration Algorithm Based on Function Approximation. , 2017, 30(7): 663-672.
[1] SUTTON R S, BARTO G A. Reinforcement Learning: An Introduction. Cambridge, USA: The MIT Press, 1998. [2] HOWARD R A. Dynamic Programming and Markov Processes. Cambridge, USA: The MIT Press, 1960. [3] BERTSEKAS D P. Dynamic Programming and Optimal Control. Belmont, USA: Athena Scientific, 2000. [4] POWEEL W B. Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York, USA: John Wiley & Sons, 2007. [5] PUTERMAN M L, SHIN M C. Modified Policy Iteration Algorithms for Discounted Markov Decision Problems. Management Science, 1978, 24(11): 1127-1137. [6] BOUTILIER C, DEARDEN R, GOLDSZMIDT M. Exploiting Structure in Policy Construction // Proc of the 14th International Joint Conference on Artificial Intelligence. San Francisco, USA: Morgan Kaufnann, 1985, II: 1104-1111. [7] MUNOS R. Error Bounds for Approximate Policy Iteration // Proc of the 20th International Conference on Machine Learning. Palo Alto, USA: AAAI Press, 2003: 560-567. [8] MUNOS R. Error Bounds for Approximate Value Iteration // Proc of the 20th National Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2005, II: 1006-1011. [9] WRIGHT R W, QIAO X Y, LOSCALZO S, et al. Improving Approximate Value Iteration with Complex Returns by Bounding // Proc of the 29th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2015: 3087-3093. [10] DAI P, Goldsmith J. Topological Value Iteration Algorithm for Markov Decision Processes // Proc of the 20th International Joint Conference on Artificial Intelligence. San Francisco, USA: Morgan Kaufmann, 2007: 1860-1865. [11] DAI P, WELD D S, GOLDSMITH J. Topological Value Iteration Algorithms. Journal of Artificial Intelligence Research, 2011, 42: 181-209. [12] LAGOUDAKIS M G, PARR R. Least-Squares Policy Iteration. Journal of Machine Learning Research, 2003, 4: 1107-1149.
[13] LI L H, WILLIAMS J D, BALAKRISHNAN S. Reinforcement Learning for Dialog Management Using Least-Squares Policy Iteration and Fast Feature Selection[C/OL]. [2016-08-26]. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/li2009interspeech.pdf. [14] LIM Z W, HSU D, LEE W S. Monte Carlo Value Iteration with Macro-actions // SHAWE-TAYLOR J, ZEMEL R S, BARTLETT P L, et al., eds. Advances in Neural Information Processing Systems 24. Cambridge, USA: The MIT Press, 2011: 1287-1295. [15] CHAFIK S, DAOUI C. A Modified Value Iteration Algorithm for Discounted Markov Decision Processes. Journal of Electronic Commerce in Organizations, 2015, 13(3): 47-57. [16] 黄 蔚,刘 全,孙洪坤,等.基于拓扑序列更新的值迭代算法.通信学报, 2014, 35(8): 56-62. (HUANG W, LIU Q, SUN H K, et al. Optimized Algorithm for Value Iteration Based on Topological Sequence Backups. Journal on Communications, 2014, 35(8): 56-62.) [17] GEIST M, PIETQUIN O. Parametric Value Function Approximation: A Unified View // Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. Washington, USA: IEEE, 2011. DOI: 10.1109/ADPRL.2011.5967355.